Summary

### The Setting, Revisited

The reinforcement learning (RL) framework is characterized by an agent learning to interact with its environment.
At each time step, the agent receives the environment's state (the environment presents a situation to the agent), and the agent must choose an appropriate action in response. One time step later, the agent receives a reward (the environment indicates whether the agent has responded appropriately to the state) and a new state.
All agents have the goal to maximize expected cumulative reward, or the expected sum of rewards attained over all time steps.

A task is an instance of the reinforcement learning (RL) problem.
Continuing tasks are tasks that continue forever, without end.
Episodic tasks are tasks with a well-defined starting and ending point.
- In this case, we refer to a complete sequence of interaction, from start to finish, as an episode.
- Episodic tasks come to an end whenever the agent reaches a terminal state.

Reward Hypothesis: All goals can be framed as the maximization of (expected) cumulative reward.

(Please see Part 1 and Part 2 to review an example of how to specify the reward signal in a real-world problem.)

The return at time step t is G_t := R_{t+1} + R_{t+2} + R_{t+3} + \ldots
The agent selects actions with the goal of maximizing expected (discounted) return. (Note: discounting is covered in the next concept.)

The discounted return at time step t is G_t := R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots .
The discount rate \gamma is something that you set, to refine the goal that you have the agent.
- It must satisfy 0 \leq \gamma \leq 1.
- If \gamma=0, the agent only cares about the most immediate reward.
- If \gamma=1, the return is not discounted.
- For larger values of \gamma, the agent cares more about the distant future. Smaller values of \gamma result in more extreme discounting, where - in the most extreme case - agent only cares about the most immediate reward.

The state space \mathcal{S} is the set of all (nonterminal) states.
In episodic tasks, we use \mathcal{S}^+ to refer to the set of all states, including terminal states.
The action space \mathcal{A} is the set of possible actions. (Alternatively, \mathcal{A}(s) refers to the set of possible actions available in state s \in \mathcal{S}.)
(Please see Part 2 to review how to specify the reward signal in the recycling robot example.)
The one-step dynamics of the environment determine how the environment decides the state and reward at every time step. The dynamics can be defined by specifying p(s',r|s,a) \doteq \mathbb{P}(S_{t+1}=s', R_{t+1}=r|S_{t} = s, A_{t}=a) for each possible s', r, s, \text{and } a.
A (finite) Markov Decision Process (MDP) is defined by:
- a (finite) set of states \mathcal{S} (or \mathcal{S}^+, in the case of an episodic task)
- a (finite) set of actions \mathcal{A}
- a set of rewards \mathcal{R}
- the one-step dynamics of the environment
- the discount rate \gamma \in [0,1]